Sequence molecules are DNA, RNA and protein molecules whose structures are determined by an underlying molecular sequence. They are derived from DNA
, RNA
and Protein
classes in the bioseq
module. Note that any instantiation from these classes refers to a single strand of bases. For multi-stranded objects like double stranded DNA or DNA-RNA complexes, each strand will have to be instantiated separately.
Internally, the type hierarchies for DNA
, RNA
and Protein
are
Molecule -> SequenceMolecule -> Polynucleotide -> DNA,RNA
Molecule -> SequenceMolecule -> Polypeptide -> Protein
All SequenceMolecule
objects have a sequence
attribute, which holds a reference to a Bio.Seq.Seq
object from Biopython. During instantiation, set the use_permissive_alphabet
to indicate whether a permissive alphabet is to be used (default) or a strict one, e.g., GATCRYWSMKHBVDN
vs. GATC
.
Instance Attribute | Setter | Getter | Unsetter | Modifier |
---|---|---|---|---|
id |
set_id() |
get_id() |
||
sites |
add_sites(*sites) |
get_sites(**kwargs) |
remove_sites(*sites) |
|
sequence |
set_sequence(inputstr) |
get_sequence(**kwargs) get_sequence_length() |
replace_sequence(**kwargs) delete_sequence(**kwargs) insert_sequence(**kwargs) |
The method get_sequence()
has the input signature
get_sequence(start=None,end=None,length=None,as_string=False)
.
Sequences are indexed like Python strings, and a subsequence can be located given a (start,end)
coordinate or a (start,length)
coordinate. If both end
and length
are provided, length
is ignored. as_string
indicates whether the output is a pure Python string or a Bio.Seq.Seq
object (by default).
In [1]:
# create a DNA molecule with a particular sequence
from wc_rules.bioseq import DNA, RNA, Protein
inputstr = 'TTGTTATCGTTACCGGGAGTGAGGCGTCCGCGTCCCTTTCAGGTCAAGCGACTGAAAAACCTTGCAGTTGATTTTAAAGCGTATAGAAGACAATACAGA'
dna1 = DNA(use_permissive_alphabet=False).set_sequence(inputstr)
dna1.get_sequence()
Out[1]:
In [2]:
# Get entire sequence
dna1.get_sequence()
Out[2]:
In [3]:
# Get a subsequence using (start,end)
dna1.get_sequence(start=90,end=99)
Out[3]:
In [4]:
# Get a subsequence using (start,length)
dna1.get_sequence(start=90,length=9)
Out[4]:
In [5]:
# Get a subsequence by unpacking a dict
loc = dict(start=90,end=99)
dna1.get_sequence(**loc)
Out[5]:
In [6]:
# Output as string
dna1.get_sequence(start=90,end=99,as_string=True)
Out[6]:
In [7]:
# Get sequence length
dna1.get_sequence_length()
Out[7]:
In [8]:
# Get subsequence length, only (start,end) allowed
dna1.get_sequence_length(start=90,end=99)
Out[8]:
Polynucleotide
objects, (i.e., DNA and RNA) have the following additional methods that read the molecular sequence, perform alphabet conversion, and return a sequence object (Bio.Seq.Seq
):
get_DNA(**kwargs)
, returns a DNA sequence get_RNA(**kwargs)
, returns an RNA sequenceget_protein(**kwargs)
, returns a protein sequence The following kwargs are common to all three methods: start=None
, end=None
, length=None
, as_string=False
, option='coding|complementary|reverse_complementary'
.
start,end,length
kwargs behave exactly the same as for get_sequence()
.
The option
kwarg indicates how the sequence is processed:
option=coding
calls get_sequence()
, then performs alphabet conversion (default),option=complementary
calls get_sequence()
, converts to complement, then performs alphabet conversion,option=reverse_complementary
calls get_sequence()
, converts to reverse complement, then performs alphabet conversion.The get_protein()
method has additional kwargs table=1
,to_stop=False
, which follow the recipe for the Biopython method translate()
.
In [9]:
inputstr = 'TTGTTATCGTTACCGGGAGTGAGGCGTCCGCGTCCCTTTCAGGTCAAGCGACTGAAAAACCTTGCAGTTGATTTTAAAGCGTATAGAAGACAATACAGA'
dna1 = DNA(use_permissive_alphabet=False).set_sequence(inputstr)
dna1.get_sequence()
Out[9]:
In [10]:
# Converting reverse complement to RNA, then initializing an RNA molecule
seq1 = dna1.get_rna(option='reverse_complementary')
rna1 = RNA(use_permissive_alphabet=False).set_sequence(seq1)
rna1.get_sequence()
Out[10]:
In [11]:
# Converting coding sequence to protein, then initializing a protein molecule
seq1 = dna1.get_protein()
prot1 = Protein(use_permissive_alphabet=False).set_sequence(seq1)
prot1.get_sequence()
Out[11]:
In [12]:
# Converting only a subset of the coding sequence to protein
seq1 = dna1.get_protein(start=66,end=99)
prot1 = Protein(use_permissive_alphabet=False).set_sequence(seq1)
prot1.get_sequence()
Out[12]: